ML Model Deployment

Model Serving

Deployment procedures

Deployment approaches

Batch Prediction Versus Online Prediction

What are they

Batch prediction (asynchronous) Online prediction (synchronous)
Frequency Periodical, such as every four hours As soon as requests come
Useful for Processing accumulated data when you don’t need immediate results (such as recommender systems) When predictions are needed as soon as a data sample is generated (such as fraud detection)
Optimized for High throughput Low latency
Cost Pay for resources while endpoint is running Pay for resources for the duration of the batch job

Unifying Batch Pipeline and Streaming Pipeline

Nowadays, the two pipelines can be unified by having two teams: the ML team maintains the batch pipeline for training while the deployment team maintains the stream pipeline for inference
Pasted image 20230711142806.png|400

How to accelerate ML model inference

There are three main approaches to reduce its inference latency: